Signed Distance Correlation (SiDCo): an online implementation of distance correlation and partial distance correlation for data-driven network analysis

Abstract Motivation There is a need for easily accessible implementations that measure the strength of both linear and non-linear relationships between metabolites in biological systems as an approach for data-driven network development. While multiple tools implement linear Pearson and Spearman methods, there are no such tools that assess distance correlation. Results We present here SIgned Distance COrrelation (SiDCo). SiDCo is a GUI platform for calculation of distance correlation in omics data, measuring linear and non-linear dependencies between variables, as well as correlation between vectors of different lengths, e.g. different sample sizes. By combining the sign of the overall trend from Pearson’s correlation with distance correlation values, we further provide a novel “signed distance correlation” of particular use in metabolomic and lipidomic analyses. Distance correlations can be selected as one-to-one or one-to-all correlations, showing relationships between each feature and all other features one at a time or in combination. Additionally, we implement “partial distance correlation,” calculated using the Gaussian Graphical model approach adapted to distance covariance. Our platform provides an easy-to-use software implementation that can be applied to the investigation of any dataset. Availability and implementation The SiDCo software application is freely available at https://complimet.ca/sidco. Supplementary help pages are provided at https://complimet.ca/sidco. Supplementary Material shows an example of an application of SiDCo in metabolomics.


Introduction
The analysis of biological networks, as a parallel investigation to the study of individual feature characteristics, requires robust quantification of the interconnections between features within biological systems (Ma'ayan 2011). Several methods for data-driven network determination of feature interconnections have been used in the analysis of metabolomic data. Pearson or Spearman correlation-based methods are arguably the most prevalent (Amara et al. 2022). While providing critical information about the direction of dependencies, both methods measure linear or monotonic correlations and cannot detect non-linear feature interactions (Rosato et al. 2018 Applications Note 2021). In metabolomic and lipidomic datasets, distance correlation can take into consideration the sparse coverage of feature data, the potential for determining non-linear relationships, and the possibly random network topologies associated with metabolism and inherent to lipidomic and metabolomic datasets with zero correlation only obtained for fully independent features. Despite these advantages, few publications have used distance correlation to analyze metabolomic data (Oliveira et al. 2015;Tang et al. 2019;Cuperlovic-Culf et al. 2021). We suggest that this is, in part, due to the lack of easily accessible implementations. Moreover, no parallel implementation, to our knowledge, allows users to assess partial correlations, calculated as the measure of association between pairs of features while removing the confounding effects of other variables. To address this need and thereby provide new methods for the reconstruction of regulatory metabolomic and lipidomic networks, we present here SIgned Distance COrrelation (SiDCo), a web-based application of both signed distance correlation and partial distance correlation implemented using the Gaussian Graphical Model (GGM) previously implemented for other correlation approaches (Lauritzen 1996).

Implementation
SiDCo is implemented in Python with a RShiny front-end. It is compatible with all web browsers. Two analytical tabs allow users to perform either distance correlation or partial distance correlation. In both applications, users define their desired threshold values and P values. Data are automatically z-score normalized across all samples prior to analysis. Users are reminded that missing values must be imputed according to their specifications or data will be returned with the descriptive error message.
Distance correlations and P values are calculated and presented as described below and a correlation directionality sign is derived from Pearson correlation analysis as an indication of the overall linear trend in the data. Distance correlation calculations in SiDCo are provided in three forms: (i) "one-to-one," calculating correlations between each pair of features, (ii) "one-to-all," providing correlations for each feature with all other features combined, and (iii) partial correlation calculated for each pair of features while controlling for the contributions of other features, i.e. covariates.
Distance correlation, dCor X; Y ð Þbetween features X and Y and distance covariance, dCov X; Y ð Þare calculated as: where A and B represent doubly centered distance matrices for variables X and Y, respectively, measured in n samples. Distance variances (dVar) are: dVar X ð Þ ¼ 1 In a one-to-one correlation calculation, an array of values for each feature is compared with an array of values for all the other features one at a time. In this case, a doubly centered distance matrix is calculated as: A j;k ¼ a j;k À Àa j: À Àa k: þ %a : and B j;k ¼ b j;k À Àb j: À Àb k: þ %b : ; where Euclidean distance is used to calculate x j to x k or y j to y k as .a j;k ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx j À x k Þðx j À x k Þ 0 p and b j;k ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðy j À y k Þðy j À y k Þ 0 p . Àa j: , Àb j: and Àa k: Àb k: are the j-row and k-column mean values and %a : , %b : are the overall mean of matrices A and B.
In a one-to-all case, the distance covariance for each feature out of m features in n dimensional sample space is compared to that of all the other features in n x (m-1) dimensional space. The doubly centered distance matrix for variable Y used in the calculation of dCov is here: b j;k ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P mÀ1 s¼1 y js À y ks ð Þ y js À y ks ð Þ 0 s and equivalent for a j;k for X. The distance correlation P value is calculated using the Student's t cumulative distribution function with t value calculated as: t X; YÞ ¼ dCorðX; YÞ ffiffiffiffiffiffiffiffiffiffiffi ffi n À 2 p = ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 1 À dCorðX; YÞ 2 q and corresponding two-sided P value for the t-distribution with n-2 degrees of freedom. The sign of the distance correlation is given by the sign of the Pearson correlation following (Pardo-Diaz et al. 2021). The final output is provided as an .xlsx file and includes distance correlation and corresponding P values. Output of one-to-one analysis also includes the Pearson and Spearman correlations and their corresponding P values for completeness.
Partial correlation, the correlation between two features corrected for contributions of other features, is calculated as (following GGM): Where matrix x ¼ R À1 is inverse of R -distance covariance matrix. The inverse of the distance covariance matrix uses the Moore-Penrose method for pseudo-inverse which is equivalent to standard inversion for non-singular square matrices and multiplicative inverse for singular matrices where inverse is not possible. Partial distance correlation calculation should only be performed when number of features is smaller than number of samples. Here P values are calculated using the Fisher z-transformed correlation values: z ij ¼ 0:5Ãlog 1þq ij 1Àq ij and cumulative standard normal distribution (cdtf) function: p i;j ¼ 2ð1 À cdtf ðz ij Ã ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N À M À 1 p Þ). Results are provided as .xlsx downloads.

Conclusion
SiDCo is an open-access Web-based application for the calculation of signed and partial distance correlations between features available at https://complimet.ca/SiDCo where detailed instructions are provided.

Supplementary data
Supplementary data are available at Bioinformatics online.